-
-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Use more compact cache representation for int and str #19750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This comment has been minimized.
This comment has been minimized.
JukkaL
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice -- did you check how much this reduces the size of binary cache files?
| write_int(b, 2 ** 85) | ||
| write_int(b, 255) | ||
| write_int(b, -1) | ||
| write_int(b, -255) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test also the edge cases (-11, -10, -9, 116, 117, 118). Test a few more different lengths of integers (e.g. 15 bits, 23 bits, 30 bits) with arbitrary lower bits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I think it would also make sense to test something like len(data.getvalue()) == 1 etc.
It looks like it's ~40% smaller now, for example:
|
|
According to mypy_primer, this change doesn't affect type check results on a corpus of open source code. ✅ |
After looking more at some real data I found that:
ints are between -10 and 117. Values are a bit arbitrary TBH, the idea is that we should include small negative values (forTypeVarIds) and still be able to fit them in 1 byte.Note there are very few integers that would fit in two bytes currently. This is because we only store line for type alias nodes, and type aliases are usually defined at the top of a module. We can add special case for two bytes later when needed.
We could probably save another byte for long strings and medium integers, but I don't want to have anything fancy that would only affect less than 0.1% cases.
Finally you may notice I add a small correctness change I noticed accidentally when working on this, it is not really related, but it is so minor that it doesn't deserve a separate PR.